Course: Visual Analytics for Policy and Management

Final Project: Broadband Internet Access in Stevenson, WA


Table of Contents:

  1. Background

  2. Univariate plots

  3. Bivariate plots

  4. Multivariate plots

  5. Conclusion

Background

Stevenson, WA, is a small town of about 1500 people in the Portland Metropolitan Region. Although the small city features various natural and cultural amenities, many of its residents, businesses, and organizations have expressed dissatisfaction with local internet services. In February 2018, Stevenson City Council directed city staff to pursue the creation of a strategic plan regarding broadband infrastructure.

Broadband is high-speed internet, or internet with minimum speeds of 25 megabits per second (mbps) downloading and 3 mbps uploading, according to the US Federal Communications Commission (FCC). One of our team members, Kimberly Pearson, when interning at the City of Stevenson as the Broadband Project Coordinator, created a survey to assess current internet quality and uptake for busiesses and organizations. After sending the survey to over 150 small and medium sized businesses within city limits, 40 businesses responded, a ~26% response rate. The 40 survey questions refer to the business’ chosen Internet Service Provider, the cost of their internet service, the speeds they subscribe to, their actual experienced speeds, their satisfaction with their internet plan, and more. The results were then visualized in Tableau (see plots below).

Original plots: https://public.tableau.com/profile/kimberly.pearson#!/vizhome/StevensonBusinessSurvey/Story1

Within this report, we would like to explore the following research questions as they relate to each type of plot.

1) Univariate

For our univariate plots, we wanted to examine the range of download and upload speeds offered by each Internet Service Provider (ISP) answer the following questionS:

  1. “Which is the range of download speeds provided by each ISP?”

  2. “Which is the range of upload speeds provided by each ISP?”

We utilized two different box plots to answer these questions.

2) Bivariate

For our bivariate plots, we wanted to examine the “advertised download speed” of each ISP and the recorded “download speed” of each ISP to answer the following questions:

  1. “Which ISP most accurately advertised their download speeds?”

  2. “Which ISP most accurately advertised their upload speeds?”

We utilized two scatterplots to answer these questions.

3) Multivariate

For our first multivariate plot, we wanted to examine the relationship between “subscribed speed”, “download speed”, “upload speed”, and “satisfaction”. We specifically wanted to answer the following question:

  1. “Is there a relationship between customer satisfaction with Internet speed and download speed, upload speed and subscribed speed?
  2. If not, what variables are correlated to overall customer satisfaction with Internet services?

We created two regression models and used a dot and whisker plot to answer our first question.

To respond our second question we used a correlation plot.

  • Map of Washington Fixed Broadband

Go to table of contents.

Univariate Plots

Our univariate plots are boxplots. A boxplot is a graph that provides a good indication of how the values in the data are spread out. Boxplots are a standardized way of displaying the distribution of data based on a five number summary (the minimum value, first quartile (Q1), the median, third quartile (Q3), and the maximum value). They also show the outliers.

Research Questions
  1. “Which is the range of download speeds provided by each ISP?”
  2. “Which is the range of upload speeds provided by each ISP?”

We expect to find similar Internet speed ranges across all providers.

Explanation

We created the plots using the following steps. Additionally, notes for each of the steps are included in the code.

  1. Load the data
  2. Call GGPlot2 from library
  3. Create a base boxplot with the relevant variables (ISP and Download.Speed or ISP and Upload.Speed)
  4. Add relevant color and the flip the axes
  5. Include a title, data source, and label for each of the axes
  6. Edit the location, size, color, and type style of each text element using the (theme) function
  7. Edit the scale and tick marks of the relevant axes
  8. Visualize our plot
# Here we load the dataset. 
link='https://github.com/marcevl/Team-Assignments/raw/master/Internet.RData'
load(file=url(link))

# Then, we call the GGPLOT2 library. If GGPLOT2 has not been installed yet, go to the bottom right quadrant of your screen, select "Packages", click install, then select ggplot2 
library(ggplot2)

# We create a box plot for each ISP, illustrating the ranges of their dowload speeds. 
box1 = ggplot(Internet, aes(x = ISP, y=Download.Speed))

# We add information regarding the colors of the plot, and then flip it so the axes are reversed. 
box2 = box1 + geom_boxplot(color='black',fill='white',
                           outlier.color='black',
                           outlier.size=2) + coord_flip()
                           
# Here, we add a title, caption, and change the labels on each axis. 
box3 = box2 + labs(title='Wave provides highest range of download speeds', subtitle = 'Stevenson, WA',
                   y = 'Download Speed (Megabytes/Second)',
                   x = NULL,
                   caption = 'Source: Stevenson Business and Organization Internet Survey, 2018 
                   Note: Wave Outlier of 500 Mbps Omitteed')

# Then, we format the location, size, color, and type style of each text element. 
box4 = box3 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
                    plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=0.5)),
                    plot.subtitle = (element_text(hjust = 0.5)),
                    plot.caption=element_text(size=8, vjust = -2),
                    text=element_text(color='gray25', size = 13),
                    panel.background = element_rect(fill='white'),
                    axis.ticks.y=element_blank(),
                    axis.line.x=element_line(color='black'))

# Finally, we change the scale of each axis so that we show the exact number of tick marks that we'd like. 
box5 = box4 + scale_y_continuous(breaks=seq(0,100, 10),
                                         limits=c(0,100)) + scale_x_discrete(limits=c('Wave','GorgeNet','CenturyLink'))

# Then, we "print" the vizualization. 
box5

The boxplot shows that GorgeNet has the least spread out download speed distribution. It means that there is not a huge variation in download speeds provided to GorgeNet’s customers. In contrast, Wave provides the highest range of download speeds, with some customers receiving less than 10 Mbps and some others more than 50 Mbps. Century Link has a small download speed range, too; however, the plot indicates the presence of an outlier.

# We created a second box plot for each ISP, illustrating the ranges of their upload speeds. 
box1 = ggplot(Internet, aes(x= ISP, y = Upload.Speed)) 

# We add information regarding the colors of the plot, and then flip it so the axes are reversed. 
box2 = box1 + geom_boxplot(color='black',fill='white',
                           outlier.color='black',
                           outlier.size=2) + coord_flip()

# Here, we add a title, caption, and change the labels on each axis.
box3 = box2 + labs(title='Wave provides highest range of upload speeds', subtitle = 'Stevenson, WA',
                   y = 'Upload Speed (Megabytes/Second)',
                   x = NULL,
                   caption = 'Source: Stevenson Business and Organization Internet Survey, 2018')

# Then, we format the location, size, color, and type style of each text element. 
box4 = box3 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
                    plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=0.5)),
                    plot.subtitle = (element_text(hjust = 0.5)),
                    plot.caption=element_text(size=8, vjust=-2),
                    text=element_text(color='gray25', size = 13),
                    panel.background = element_rect(fill='white'),
                    axis.ticks.y=element_blank(),
                    axis.line.x=element_line(color='black'))

# Finally, we change the scale of each axis so that we show the exact number of tick marks that we'd like. 
box5 = box4 + scale_y_continuous(breaks=seq(0,100, 10),
                                         limits=c(0,100)) + scale_x_discrete(limits=c('Wave','GorgeNet','CenturyLink'))

# Then, we "print" the visualization. 
box5

The results of this plot are consistent with the previous one. Wave provides the highest range of upload speeds, while Century Link provides the lowest range but there is an outlier. GogeNet also provides a small range of variation in upload speeds. When we compare both plots, we can conclude that, in general, download median speeds are higher than upload median speeds and, in contrast with the other to ISP, Wave provides a high range of Internet speeds.


Go to table of contents.

Bivariate plots

For our bivariate plots, we decided to create scatterplots. Scatterplots use dots to represent the values for two different variables showing the relationship between them. In our case, we wanted to compare customers’ subscribed and actual speeds.

Research Questions

“Which ISP most accurately advertised their download speeds?” “Which ISP most accurately advertised their upload speeds?”

We expected to find a linear relationship between both variables, advertised and actual Internet speeds. This would mean that the advertised speed corresponds to the actual speed provided.

Explanation

We can create the plots using the following steps. Additionally, notes for each of the steps are included in the code.

  1. First create a factor for ISP using the existing data
  2. Then create a dataframe with just the relevant factors. These include subscribed speed, download speed, upload speed, and ISP
  3. Subset the data to only include the ISPs of interest - Wave, CenturyLink, and GorgeNet
  4. Create a baseplot of Subscribed Speeds against Upload Speeds
  5. Add a scatterplot element
  6. Add a reference line to measure against the Upload Speed against
  7. Separate the plots by ISP using the facetwrap command
  8. Modify the scale for each axis
  9. Modify the title, subtitle, data source, and labels for each axes
  10. Utilize the BW formatting preset
  11. Format the text elements
  12. Visualize our plot
# For our second plots, we create a factor, ISP, with specific levels. It is based off the original variable. 
ISP <- factor(Internet$ISP)
str(ISP)
##  Factor w/ 6 levels "AT&T","CenturyLink",..: 2 3 6 2 3 2 1 6 6 3 ...
levels(ISP)
## [1] "AT&T"        "CenturyLink" "GorgeNet"    "Integra"     "None"       
## [6] "Wave"
# Then, we create a dataframe with just the factors that we are concerned with in these plots. It contains information for all of the ISPs. 
df1 <- data.frame(ISP, Internet$Subscribed.Speed, Internet$Download.Speed, Internet$Upload.Speed)
names(df1) <- c("ISP", "Subscribed", "Download", "Upload")

# Next, we subsetted the data to only include Wave, CenturyLink, and GorgeNet. 
df2 <- subset(df1, ISP %in% c("Wave", "CenturyLink", "GorgeNet"))

# We then created a base plot of Subscribed Speeds versus Upload Speeds. 
base = ggplot(df2, aes(x=Subscribed, y=Upload))

# We added a scatterplot layer. 
box1 = base + geom_point()

# Then, a reference line that represents the subscribed speeds. 
box2 = box1 + geom_abline(a=0,b=45,lty=2,lwd=0.5, alpha = 0.5)

# We used the command facet wrap to separate the plot into three plots, one for each ISP. 
box3 = box2 + facet_wrap( ~ ISP ,nrow = 1)

# Then, we added a scale for the y axis and a scale for the x axis. 
box4 = box3 + scale_y_continuous(breaks=c(0,0,40,60,80,100,120),
                                 limits = c(0, 120))

box5 = box4 + scale_x_continuous(breaks=c(0,30,60,90,120,150),
                                 limits = c(0, 150))

# We coded for the title. 
Title = 'GorgeNet Provides the Most Accurate Advertised Upload Speeds'

# We added the title, subtitle, x-axis label, y-axis label, and the caption. 
box6 = box5 + labs(title=Title, subtitle = 'Stevenson, WA',
                   x= 'Suscribed Speed (Megabytes/Second)',
                   y = 'Upload Speed (Megabytes/Second)',
                   caption='Source: Stevenson Business and Organization Internet Survey,2018') + theme_bw()

# We formatted for each text element, as before. 
box7 = box6 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
                    plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=0.5)),
                    plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
                    plot.caption=element_text(size=8, vjust=-2),
                    text=element_text(color='gray25', size = 11),
                    axis.line.y=element_line(color = 'black'),
                    axis.line.x=element_line(color='black'),
                    axis.text=element_text(size=8))

# Finally, we "printed" the faceted scatterplot.
box7

The plot shows that any ISP accurately advertises its upload speeds. Century Link and GorgeNet provide slower upload speeds than the ones they publicize. In most cases, Wave’s actual upload speeds are slower than the advertised ones, but in our sample, there were two cases at which the upload speed provided exceeded the advertised speed.

# The instructions to create the faceted scatterplot are the same as in the past plot. 
base = ggplot(df2, aes(x=Subscribed, y=Download))

box1 = base + geom_point()

box2 = box1 + geom_abline(a=0,b=45,lty=2,lwd=0.5, alpha = 0.5)

box3 = box2 + facet_wrap( ~ ISP ,nrow = 1)

box4 = box3 + scale_y_continuous(breaks=seq(0,120,20),
                                 limits = c(0,120))

box5 = box4 + scale_x_continuous(breaks=seq(0,120,40),
                                 limits = c(0,120))

Title = 'GorgeNet Provides the Most Accurate Advertised Download Speeds'

box6 = box5 + labs(title=Title, subtitle = 'Stevenson, WA',
                   x= 'Suscribed Speed (Mbps)',
                   y = 'Download Speed (Mbps)',
                   caption='Source: Stevenson Business and Organization Internet Survey, 2018
                   Note: Wave Outlier of 500 Mbps Omitted') + theme_bw()

box7 = box6 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
                    plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=0.5)),
                    plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
                    plot.caption=element_text(size=8, vjust=-3),
                    text=element_text(color='gray25', size = 11),
                    axis.line.y=element_line(color = 'black'),
                    axis.line.x=element_line(color='black'),
                    axis.text=element_text(size=8),
                    axis.text.x = element_text(angle = 0, vjust = 0.7),
                    axis.title.x = element_text(vjust = -2, size = 10),
                    axis.title.y = element_text(size = 10, vjust = 2.5))
box7

This scatterplot indicates that GorgeNet is the only ISP whose advertised download speeds closely match the actually provided speeds. Century Link provides slower upload speeds than the ones it announces, although in some cases, download speeds are closed to the publicized speeds. Most Wave’s customers in the dataset receive slower download speeds than they subscribed for; a couple of them get speeds very close to their subscription speeds, and a couple more get faster speeds than the advertised ones.


Go to table of contents.

Multivariate Plots

We created two types of multivariate plots, a dot and whisker plot and a correlation plot.

The dot and whisker Plot describes two models of regressions. The first model determines whether actual upload and download speeds are related to customer’s satisfaction with speed. The second model includes subscribed speeds as third variable. The dots represent the coefficientes and the whiskers the confidence intervals.

The correlation plot looks at the relationship between overall satisfaction with Internet services and other explanatory variables.

Research Questions
  1. “Is there a relationship between customer satisfaction with Internet speed and download speed, upload speed and subscribed speed?
  2. If not, what variables are correlated to overall customer satisfaction with Internet services?

We expected to find a positive relationship between customer satisfaction with Internet speed and download speed, upload speed and subscribed speed.

Explanation: Dot and Whisker Plots

We can create the dot and whisker plot using the following steps. Additionally, notes for each of the steps are included in the code.

  1. Load the data
  2. Call dotwhisker, broom, and dplyr from library
  3. Create the first model of regression with the relevant variables (Download.Speed and Upload.Speed regressed onto Satisfaction.Speed)
  4. Tidy and mutate the first model
  5. Create the second model of regression with the relevant variables (Download.Speed, Upload.Speed, and Subscribed.Speed regressed onto Satisfaction.Speed)
  6. Combine the models
  7. Plot the models, add theme information, and add titles
  8. Visualize our plot
# Here, we called three libraries that would help us make dot and whisker plots as well as 
library(dotwhisker)
library(broom)
library(dplyr)

# We created the first model of regression for Download and Upload Speeds on Satisfaction with Speed
model1=lm(Satisfaction.Speed~Download.Speed+Upload.Speed,data=Internet2[,-1])
summary(model1)
## 
## Call:
## lm(formula = Satisfaction.Speed ~ Download.Speed + Upload.Speed, 
##     data = Internet2[, -1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4111 -0.4159  0.4289  0.5825  1.5775 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.395876   0.219081  15.501 2.49e-14 ***
## Download.Speed 0.001154   0.008049   0.143    0.887    
## Upload.Speed   0.002786   0.007791   0.358    0.724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.043 on 25 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.1219, Adjusted R-squared:  0.05169 
## F-statistic: 1.736 on 2 and 25 DF,  p-value: 0.1968
# Here we tidied up and mutated the model of regression so thatwe add a column, "model," with the values of Model 1. 
model1_t = tidy(model1) %>% mutate(model = "Model 1") %>%
   relabel_predictors(Upload.Speed = "Upload",
                     Download.Speed = "Download",
                     Subscribed.Speed = "Subscribed")
# We then created a second model of regression, this time with a fourth variable: Subscribed Speeds. 
model2=lm(Satisfaction.Speed~Download.Speed+Upload.Speed+Subscribed.Speed,data=Internet2[,-1])
summary(model2)
## 
## Call:
## lm(formula = Satisfaction.Speed ~ Download.Speed + Upload.Speed + 
##     Subscribed.Speed, data = Internet2[, -1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3054 -0.3388  0.2492  0.6909  1.6888 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.1851440  0.3286583   9.691 1.38e-09 ***
## Download.Speed    0.0010622  0.0098589   0.108    0.915    
## Upload.Speed     -0.0001521  0.0093008  -0.016    0.987    
## Subscribed.Speed  0.0064991  0.0082764   0.785    0.440    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.069 on 23 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.145,  Adjusted R-squared:  0.03353 
## F-statistic: 1.301 on 3 and 23 DF,  p-value: 0.2981
# We tidied up and mutated the second model of regression to have a column called "model," with values of "Model 2"
model2_t <- tidy(model2) %>% mutate(model = "Model 2") %>%
   relabel_predictors(Upload.Speed = "Upload",
                     Download.Speed = "Download",
                     Subscribed.Speed = "Subscribed")
# Here we combined all the models 
allModels=rbind(model1_t, model2_t)
# Here we plot the coefficients and their confidence intervals for all models and add or change theme/text elements 
dwplot(allModels) + 
  geom_vline(xintercept = 0, 
               colour = "grey 50", 
               linetype = 2) + 
  scale_colour_grey(start = .1, end = .7) + 
  labs(y = "Independent Variables",
title = "Relationship Between Speed Satisfaction & 
Download, Upload, and Subscribed Speed", 
         caption = "Source:Stevenson Business and Organization Internet Survey, 2018") +
  theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
                    plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=0.5)),
                    plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
                    plot.caption=element_text(size=9, vjust=-2),
                    text=element_text(color='black', size = 11),
                    panel.background = element_rect(fill='white'),
                    legend.title = element_blank(),
                    axis.line.y=element_line(color = 'black'),
                    axis.line.x=element_line(color='black'),
                    axis.text=element_text(size=8))

Because all of the models’ confidence intervals overlapped with the value of 0 on the x-axis, we cannot conclude that the coefficients are actually different from 0 and, as a result, neither model identifies a relationship between experienced speeds or subscribed speed and customers’ satisfaction with their speed.

Explanation: Correlation Plots

Now that we’ve seen that broadband speed satisfaction is not necessarily related with download, upload and subscribed speed, we can look at other explanatory variables for relationships. A good way to do this is through a correlation plot.

#The first step is to create a new dataframe with the variables of interest

#Adding data to dataframe and removing outlier observation in row 3
df1 <- Internet2[-3,]

#Adding relevant variables to new dataframe
df.corr <- df1[,c('ISP','Cost','Subscribed.Speed','Download.Speed','Upload.Speed', 'Satisfaction.Speed', 'Satisfaction.Price','Satisfaction.Reliability','Satisfaction.Customer.Service','Satisfaction.Overall','Ease.Streaming','Ease.Downloading','Speed.Difficulty')]
#We need to convert the values for satisfaction ('Dissatisfied','Satisfied', etc.), cost, and ease ('Easy','Difficult', etc.)  into numeric factors so we can plot them compare them to each other
df.corr$Cost <- as.integer(as.factor(df.corr$Cost))
df.corr$Satisfaction.Price <- as.integer(as.factor(df.corr$Satisfaction.Price))
df.corr$Satisfaction.Reliability <- as.integer(as.factor(df.corr$Satisfaction.Reliability))
df.corr$Satisfaction.Customer.Service <- as.integer(as.factor(df.corr$Satisfaction.Customer.Service))
df.corr$Satisfaction.Overall <- as.integer(as.factor(df.corr$Satisfaction.Overall))
df.corr$Ease.Streaming <- as.integer(as.factor(df.corr$Ease.Streaming))
df.corr$Ease.Downloading <- as.integer(as.factor(df.corr$Ease.Downloading))
df.corr$Speed.Difficulty <- as.integer(as.factor(df.corr$Speed.Difficulty))

Now we can plot this new dataframe (df.corr)

#Calling up the appropriate library
library(GGally) 

#PVisaulizing onto a correlation plot
corrplot = ggcorr(df.corr[,-1], # all but the first column
       hjust = 0.9,# distance to plot (diagonal)
       size=3, # font size
       layout.exp=3, # width so that variable names are shown
       low = 'red',high = 'blue') # color scale

#Adding titles
corrplot1 = corrplot + labs(title='Correlation Plot for Broadband Satisfaction',
                   caption='Source: Stevenson Business and Organization Internet Survey, 2018')

#Adjusting the theme
corrplot2 = corrplot1 + theme(plot.title=(element_text(face='bold', #Bold title
                                            size=15, #Font size
                                            hjust=1)), #title Horizontal alignment
                              plot.caption=(element_text(hjust=1))) #Caption horizontal alignment

corrplot2

Here, higher positive correlation will apear more blue (1) while negative correlations will appear more organe (-1). While this plot is helpful, the multiple variables that measure satisfaction adds confusion. Instead, we can look at just ‘Overall Satisfaction’ and see if it is associated with any of the explanatory variables.

#Creating a new dataframe with only one variable measuring satisfaction
df.corr2 <- df.corr[,c('ISP','Satisfaction.Overall','Cost','Subscribed.Speed','Download.Speed','Upload.Speed','Ease.Streaming','Ease.Downloading')]

#Same steps as the previous plot, but with the new dataframe (df.corr2)
corrplot3 = ggcorr(df.corr2[,-1],
       hjust = 0.9,
       layout.exp=3, 
       size=3, 
       low = 'red',high = 'blue') 

corrplot4 = corrplot3 + labs(title='Correlation Plot for Broadband Satisfaction',
                             subtitle='Overall Satisfaction Only',
                   caption='Source: Stevenson Business and Organization Internet Survey, 2018')

corrplot5 = corrplot4 + theme(plot.title=(element_text(face='bold',
                                            size=15,
                                            hjust=1)),
                              plot.subtitle=(element_text(hjust=.6)),
                              plot.caption=(element_text(hjust=1, vjust = -1)))

corrplot5

This plot tells us more about the relationship between Satisfaction and the other variables. The cell with the largest negative correlation is the interaction between ‘Satisfaction.Overall’ and ‘Cost’; this indicates that people who pay more for their internet service tend to be less satisfied. The greatest positive correlation is between ‘Satisfaction.Overall’ and ‘Ease.Downloading’, which indicates that people who rate the difficulty of downloading lower are more satisfied overall with their ISP.

Map of Washington State Fixed Broadband Service

Research Question

How does fixed broadband service vary across the state of Washington?

Explanation

Below, we create a map of broadband service speeds in Washington, at the Census block code unit of analysis.


Go to table of contents.